Netflix Null Values¶



Table of Contents

  • 1  Introduction
  • 2  Goal
  • 3  Install Libraries
    • 3.1  pyspark
    • 3.2  wordcloud
  • 4  Setting Display configurations
    • 4.1  Setting itable options
  • 5  Importing Data
    • 5.1  Spark
      • 5.1.1  Why is important to use options?
    • 5.2  Pandas
      • 5.2.1  date type
    • 5.3  dates format
  • 6  Exploratory Data
    • 6.1  Data info
      • 6.1.1  Spark
      • 6.1.2  Pandas
  • 7  Netflix Data Null Values
    • 7.1  Spark Nulls
      • 7.1.1  Converting values to a tidy table: Why it matters?
      • 7.1.2  Data nulls results
    • 7.2  Pandas Nulls
      • 7.2.1  Data nulls results
  • 8  Nulls Data Merge
    • 8.1  Plotly templates
    • 8.2  Bar Plot
    • 8.3  Line Plot
  • 9  Comparing Pandas and Spark Data Processing: Key Findings and Insights
    • 9.1  Line Plot
    • 9.2  Words Cloud
  • 10  Impute Null Values
    • 10.1  Nulls Values Matrix
    • 10.2  Exploring and Cleaning
    • 10.3  No Null values
    • 10.4  Nulls Values Matrix
  • 11  Export to html

Introduction¶


The problem of missing values in data is a common issue that can arise during the data collection and preprocessing stages. Missing values can occur for a variety of reasons, such as errors in data entry or measurement, or the intentional withholding of information. These missing values can have a significant impact on the quality and accuracy of any analysis performed on the data.

One of the most important steps in dealing with missing data is to handle it appropriately before any analysis is performed. This is where imputation comes in. Imputation is the process of replacing missing values with estimated ones. The goal of imputation is to minimize the loss of information and bias in the data while maximizing the usefulness of the remaining information.

It is important to address missing values before any analysis because they can lead to biased or incorrect results. For example, if the missing data is not handled properly, it can lead to a biased estimate of the mean or standard deviation. Additionally, many statistical models and machine learning algorithms cannot handle missing data and will either fail or produce inaccurate results.

Overall, imputing missing values is crucial to ensure the validity and accuracy of any analysis performed on the data.

  • Missing values can occur for a variety of reasons, such as errors in data entry or measurement, or the intentional withholding of information.

  • Imputation is the process of replacing missing values with estimated ones.

  • The goal of imputation is to minimize the loss of information and bias in the data while maximizing the usefulness of the remaining information.
  • Handling missing values before any analysis is performed is crucial to ensure the validity and accuracy of any analysis performed on the data.
  • Many statistical models and machine learning algorithms cannot handle missing data and will either fail or produce inaccurate results.

Goal¶


The goal of this notebook is to perform an in-depth analysis of the imputation of missing values using both Spark and Pandas, two widely used data processing libraries. The analysis will include a comparison of the different methods available for imputation in each library, and the robustness of each method will be evaluated using appropriate statistical measures. The notebook will also demonstrate the implementation of these methods in a real-world scenario.

The specific objectives of this notebook are:

  1. To introduce the problem of missing values in data and explain why it is important to address it before any analysis is performed.

  2. To present an overview of the different methods of imputation available in Spark and Pandas.

  3. To evaluate the robustness of each imputation method using appropriate statistical measures.

  4. To demonstrate the implementation of the imputation methods in a real-world scenario.

  5. To compare the results obtained from the methods in Spark and Pandas and draw conclusions about which library is more suitable for imputing missing values in a given context.

Overall, this notebook aims to provide a comprehensive understanding of the imputation of missing values in data using Spark and Pandas, and help data scientists and analysts make informed decisions when dealing with missing data.

Install Libraries¶


During the process of handling missing data, it is important to install the necessary libraries that may not be present in Data Bricks. As these libraries are added, a description and important aspects of each one will be included in the notebook to ensure proper use and understanding. In this notebook is to install the necessary libraries. Some of the libraries needed for this analysis may not be present in the current environment, so it is important to include their description and key aspects when they are added. This process will also ensure that all necessary functions and tools are available for use in the subsequent steps of the notebook.

pyspark¶

PySpark is the Python API for Apache Spark, a fast and general engine for large-scale data processing. It is designed to provide a simple and easy-to-use programming interface for parallel computing on clusters of machines. PySpark is built on top of Spark's Java API and exposes the Spark programming model to Python.

In [1]:
#!pip install pyspark

wordcloud¶

The wordcloud library will be used in this notebook to create visual representations of the data, specifically to compare the missing values using pandas and Spark. This library allows for the creation of word clouds from text data, where the size of each word represents its frequency in the text. This can be useful for identifying patterns and trends in the missing values, and can also be used for exploratory data analysis.

In [2]:
#!pip install wordcloud

Setting Display configurations¶


In [3]:
import plotly.express._doc as xpdocs
import matplotlib.pyplot as plt

Setting itable options¶

In [4]:
from itables import init_notebook_mode
import itables.options as opt
init_notebook_mode(all_interactive=True)
In [5]:
%config InlineBackend.figure_format = 'retina'
In [6]:
opt.column_filters= 'footer'
opt.classes = 'display nowrap cell-border'
opt.dom = 'lftipr'
opt.search = {"regex": True, "caseInsensitive": True, "smart":True, 'highlight':True}
opt.paging = True
opt.autoWidth=False
opt.showIndex = False
opt.columnDefs=[{"width": "120px", "targets": "_all"}]

Importing Data¶


Now, we will be importing the data using both Spark and Pandas. It is important to note that both of these libraries have their own unique advantages and disadvantages when it comes to handling large datasets. Spark is highly efficient and can handle large amounts of data with ease, but it can be more difficult to work with than Pandas. On the other hand, Pandas is relatively easy to use and is great for data manipulation and cleaning, but it may struggle with very large datasets. Therefore, it is important to consider the size and complexity of the dataset when deciding which library to use for data import. Additionally, it is also important to keep in mind that the specific use case and desired outcome of the analysis may also play a role in determining which library is best suited for the task at hand.

Spark¶

PySpark is the Python library for Spark programming that allows for easy and efficient processing of large amounts of data using the power of the Apache Spark engine. One of the main advantages of using PySpark in Databricks is its ability to scale up and distribute computations across multiple machines, which can greatly improve the performance of big data processing tasks. One important thing to consider is that in the community version of Databricks, the functionality to read csv files is not available using pd.read.csv.So the code to import data could be longer than Pandas library.

In [7]:
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import SparkSession
#Create Spark Session
spark = SparkSession.builder.getOrCreate()

# File location and type
file_location = "netflix_titles.csv"

# Define the schema
schema = StructType([
    StructField("show_id", IntegerType()),
    StructField("type", StringType()),
    StructField("title", StringType()),
    StructField("director", StringType()),
    StructField("cast", StringType()),
    StructField("country", StringType()),
    StructField("date_added", DateType()),
    StructField("release_year", IntegerType()),
    StructField("rating", StringType()),
    StructField("duration", StringType()),
    StructField("listed_in", StringType())
])

# The applied options are for CSV files. For other file types, these will be ignored.
spark_data = spark.read.format('csv') \
                  .option("inferSchema", 'false') \
                  .option("header", 'true') \
                  .option("dateFormat", "MMMM d, yyyy")\
                  .schema(schema) \
                  .load(file_location)

.option() is a method provided by the Spark DataFrame API that allows you to specify options when reading in data. These options can include things like the file format (e.g. CSV), whether to infer the schema automatically, and how to handle column names. The .option() method takes two arguments: the first is the name of the option, and the second is the value for that option. The options specified with .option() will be used when the .load() method is called to read in the data. Some typical options are:

  • inferSchema: By default, Spark will try to infer the schema of the data automatically when reading in a CSV file. This can be slow and can also result in unexpected data types. Setting this option to 'false' will prevent Spark from trying to infer the schema and instead use the schema that is specified in the next option, 'schema'.

  • header: By default, Spark assumes that the first row of the CSV contains the column names. Setting this option to 'true' will tell Spark to use the first row as the column names. If this is set to 'false', Spark will not use the first row as the column names and will instead generate default column names like "col1", "col2", etc.

  • dateFormat: By default, Spark will read dates in the format "yyyy-MM-dd" but this option allows you to specify any other format of your date field.

  • schema: This option allows you to specify the schema of the data explicitly. This can be useful if you know the schema ahead of time and want to avoid the overhead of inferring it automatically.

  • load: this option allow you to specify the path of your CSV file you want to import.

Using these options when importing a CSV file in Spark can help ensure that your data is read in correctly and with the desired schema, and can improve performance by avoiding the overhead of inferring the schema automatically.

In [8]:
opt.maxBytes = spark_data.toPandas().memory_usage().sum()
In [9]:
display(spark_data.toPandas())
show_idtypetitledirectorcastcountrydate_addedrelease_yearratingdurationlisted_in
81145628MovieNorm of the North: King Sized AdventureRichard Finn, Tim MaltbyAlan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Durupt, Maya Kay, Michael DobsonUnited States, India, South Korea, China2019-09-092019TV-PG90 minChildren & Family Movies, Comedies
80117401MovieJandino: Whatever it TakesNoneJandino AsporaatUnited Kingdom2016-09-092016TV-MA94 minStand-Up Comedy
70234439TV ShowTransformers PrimeNonePeter Cullen, Sumalee Montano, Frank Welker, Jeffrey Combs, Kevin Michael Richardson, Tania Gunadi, Josh Keaton, Steve Blum, Andy Pessoa, Ernie Hudson, Daran Norris, Will FriedleUnited States2018-09-082013TV-Y7-FV1 SeasonKids' TV
80058654TV ShowTransformers: Robots in DisguiseNoneWill Friedle, Darren Criss, Constance Zimmer, Khary Payton, Mitchell Whitfield, Stuart Allan, Ted McGinley, Peter CullenUnited States2018-09-082016TV-Y71 SeasonKids' TV
80125979Movie#realityhighFernando LebrijaNesta Cooper, Kate Walsh, John Michael Higgins, Keith Powers, Alicia Sanz, Jake Borelli, Kid Ink, Yousef Erakat, Rebekah Graf, Anne Winters, Peter Gilroy, Patrick DavisUnited States2017-09-082017TV-1499 minComedies
80163890TV ShowApachesNoneAlberto Ammann, Eloy Azorín, Verónica Echegui, Lucía Jiménez, Claudia TraisacSpain2017-09-082016TV-MA1 SeasonCrime TV Shows, International TV Shows, Spanish-Language TV Shows
70304989MovieAutomataGabe IbáñezAntonio Banderas, Dylan McDermott, Melanie Griffith, Birgitte Hjort Sørensen, Robert Forster, Christa Campbell, Tim McInnerny, Andy Nyman, David RyallBulgaria, United States, Spain, Canada2017-09-082014R110 minInternational Movies, Sci-Fi & Fantasy, Thrillers
80164077MovieFabrizio Copano: Solo pienso en miRodrigo Toro, Francisco SchultzFabrizio CopanoChile2017-09-082017TV-MA60 minStand-Up Comedy
80117902TV ShowFire ChasersNoneNoneUnited States2017-09-082017TV-MA1 SeasonDocuseries, Science & Nature TV
70304990MovieGood PeopleHenrik Ruben GenzJames Franco, Kate Hudson, Tom Wilkinson, Omar Sy, Sam Spruell, Anna Friel, Thomas Arnold, Oliver Dimsdale, Diana Hardcastle, Michael Jibson, Diarmaid MurtaghUnited States, United Kingdom, Denmark, Sweden2017-09-082014R90 minAction & Adventure, Thrillers
Showing 1 to 10 of 6,236 entries
Previous12345…624Next

Why is important to use options?¶

By using the options in the import statement, we ensure that the dataframe is in the correct format according to the schema specified. For example, by setting the "inferSchema" option to 'false' we are telling Spark to not try to infer the schema automatically and instead use the schema specified in the "schema" option. This ensures that the data is correctly mapped to the desired column types.

The "dateFormat" option is important because it allows you to specify the format of the date field in the CSV file. Without this option, Spark will assume the default format "yyyy-MM-dd" and if the date in CSV is not in this format it will convert it to null. In this case the format to 'date_added' column was 'September 9, 2019' so for coorrect read the format was 'MMMM d,yyyy' By specifying the correct date format, Spark will correctly parse the date field and map it to the correct column in the dataframe.

Overall, these options allow us to have more control over the format of the dataframe and make sure that it is in the correct format according to the schema.

Pandas¶

Using pandas for data import is a simpler process compared to PySpark. The read_csv() function allows for easy import of csv files, even when they are stored in a public repository and accessed via a URL. One major advantage of using pandas is that it does not require the creation of a schema for the variables. This can save time and effort when working with large datasets. However, it's important to keep in mind that Pandas is not as efficient as PySpark when working with big data. So, if you are dealing with large datasets, PySpark is the best option.

In [10]:
import pandas as pd

url = 'netflix_titles.csv'
pandas_data = df = pd.read_csv(url, parse_dates=['date_added', 'release_year'])
In [11]:
#pandas_data['date_added'] = pd.to_datetime(pandas_data.date_added, format='%B %d, %Y').dt.date
pandas_data['release_year'] = pd.to_datetime(pandas_data.release_year).dt.year

date type¶

It is important to have the correct date format in pandas because it allows for accurate manipulation and analysis of the data. Inaccurate date formats can lead to errors or misinterpretation of the data. By using the pd.to_datetime() function and specifying the column, we are able to ensure that the 'date_added' column is correctly formatted as a date, and this allows for accurate analysis of the data based on specific dates. Additionally, using the pd.to_datetime() function to convert the 'release_year' column to datetime data and then extracts only the year component and store it, this way we are able to ensure that the 'release_year' column is correctly formatted as an integer and this allows for accurate analysis of the data based on specific years.

In [12]:
opt.maxBytes = pandas_data.memory_usage().sum()
pandas_data
Out[12]:
show_idtypetitledirectorcastcountrydate_addedrelease_yearratingdurationlisted_indescription
81145628MovieNorm of the North: King Sized AdventureRichard Finn, Tim MaltbyAlan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Durupt, Maya Kay, Michael DobsonUnited States, India, South Korea, ChinaSeptember 9, 20192019TV-PG90 minChildren & Family Movies, ComediesBefore planning an awesome wedding for his grandfather, a polar bear king must take back a stolen artifact from an evil archaeologist first.
80117401MovieJandino: Whatever it TakesNaNJandino AsporaatUnited KingdomSeptember 9, 20162016TV-MA94 minStand-Up ComedyJandino Asporaat riffs on the challenges of raising kids and serenades the audience with a rousing rendition of "Sex on Fire" in his comedy show.
70234439TV ShowTransformers PrimeNaNPeter Cullen, Sumalee Montano, Frank Welker, Jeffrey Combs, Kevin Michael Richardson, Tania Gunadi, Josh Keaton, Steve Blum, Andy Pessoa, Ernie Hudson, Daran Norris, Will FriedleUnited StatesSeptember 8, 20182013TV-Y7-FV1 SeasonKids' TVWith the help of three human allies, the Autobots once again protect Earth from the onslaught of the Decepticons and their leader, Megatron.
80058654TV ShowTransformers: Robots in DisguiseNaNWill Friedle, Darren Criss, Constance Zimmer, Khary Payton, Mitchell Whitfield, Stuart Allan, Ted McGinley, Peter CullenUnited StatesSeptember 8, 20182016TV-Y71 SeasonKids' TVWhen a prison ship crash unleashes hundreds of Decepticons on Earth, Bumblebee leads a new Autobot force to protect humankind.
80125979Movie#realityhighFernando LebrijaNesta Cooper, Kate Walsh, John Michael Higgins, Keith Powers, Alicia Sanz, Jake Borelli, Kid Ink, Yousef Erakat, Rebekah Graf, Anne Winters, Peter Gilroy, Patrick DavisUnited StatesSeptember 8, 20172017TV-1499 minComediesWhen nerdy high schooler Dani finally attracts the interest of her longtime crush, she lands in the cross hairs of his ex, a social media celebrity.
80163890TV ShowApachesNaNAlberto Ammann, Eloy Azorín, Verónica Echegui, Lucía Jiménez, Claudia TraisacSpainSeptember 8, 20172016TV-MA1 SeasonCrime TV Shows, International TV Shows, Spanish-Language TV ShowsA young journalist is forced into a life of crime to save his father and family in this series based on the novel by Miguel Sáez Carral.
70304989MovieAutomataGabe IbáñezAntonio Banderas, Dylan McDermott, Melanie Griffith, Birgitte Hjort Sørensen, Robert Forster, Christa Campbell, Tim McInnerny, Andy Nyman, David RyallBulgaria, United States, Spain, CanadaSeptember 8, 20172014R110 minInternational Movies, Sci-Fi & Fantasy, ThrillersIn a dystopian future, an insurance adjuster for a tech company investigates a robot killed for violating protocol and discovers a global conspiracy.
80164077MovieFabrizio Copano: Solo pienso en miRodrigo Toro, Francisco SchultzFabrizio CopanoChileSeptember 8, 20172017TV-MA60 minStand-Up ComedyFabrizio Copano takes audience participation to the next level in this stand-up set while reflecting on sperm banks, family WhatsApp groups and more.
80117902TV ShowFire ChasersNaNNaNUnited StatesSeptember 8, 20172017TV-MA1 SeasonDocuseries, Science & Nature TVAs California's 2016 fire season rages, brave backcountry firefighters race to put out the flames, protect homes and save lives in this docuseries.
70304990MovieGood PeopleHenrik Ruben GenzJames Franco, Kate Hudson, Tom Wilkinson, Omar Sy, Sam Spruell, Anna Friel, Thomas Arnold, Oliver Dimsdale, Diana Hardcastle, Michael Jibson, Diarmaid MurtaghUnited States, United Kingdom, Denmark, SwedenSeptember 8, 20172014R90 minAction & Adventure, ThrillersA struggling couple can't believe their luck when they find a stash of money in the apartment of a neighbor who was recently murdered.
Showing 1 to 10 of 6,234 entries
Previous12345…624Next
In [13]:
pandas_data.columns
Out[13]:
Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

dates format¶

When reading in a data frame, it is important to make sure that the data is in the correct format for analysis. This is particularly important for columns that represent dates and times, such as the 'date_added' and 'release_year' columns.

In the case of the 'date_added' column, it is formatted as '2019-09-09'. This is a standard format for dates, with the year, month, and day separated by dashes. This format is easily readable and can be easily used for temporal analysis.

The 'release_year' column is also in the correct format, with the year formatted as an integer. This format is useful when performing statistical analysis by year. It is important that the column is in this format so that it can be used for groupby and aggregation operations.

Exploratory Data¶


Data info¶

Spark¶

The printSchema() method is a useful tool when working with DataFrames in Spark as it allows you to quickly view the data types and names of columns in a DataFrame. This method displays the schema of the DataFrame in a tree format, with each level of the tree representing a level of nested fields in the schema. By using this method, you can easily identify any issues or errors in the schema, such as incorrect data types. Additionally, it can help you understand the structure of your data and make any necessary adjustments before performing further analysis or operations on the DataFrame.

In [14]:
spark_data.printSchema()
root
 |-- show_id: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- title: string (nullable = true)
 |-- director: string (nullable = true)
 |-- cast: string (nullable = true)
 |-- country: string (nullable = true)
 |-- date_added: date (nullable = true)
 |-- release_year: integer (nullable = true)
 |-- rating: string (nullable = true)
 |-- duration: string (nullable = true)
 |-- listed_in: string (nullable = true)

According to the schema provided, there are certain adjustments that need to be made to improve the data. In this case, the 'show_id' column is shown as an integer, however it must be converted to object to avoid including it in statistics.

When a column is defined as an integer, it is considered as a numeric column and it can be used in mathematical operations such as sum, average, standard deviation, etc. However, the 'show_id' column is not a numeric value, it is a unique identifier for each show, so it should not be included in mathematical operations and it should not be used as a measure in any statistical analysis.

Therefore, by converting the 'show_id' column from integer to object, we ensure that this column is treated as a string or categorical data, which is the correct data type for this column. This will prevent any errors or inaccuracies in the analysis of the data and will make the data more meaningful.

In summary, converting the 'show_id' column from integer to object is important to maintain the integrity of the data, and to prevent any errors or inaccuracies in the analysis of the data.

Pandas¶

The info() method in the Python library pandas is used to get a summary of the dataframe, including the name of the columns, their data types, and the number of non-null values.

When you use the info() method on a dataframe, it will display the following information:

  • The name of the dataframe
  • The number of rows and columns
  • The name of each column and its data type
  • The number of non-null values for each column
  • The memory usage of the dataframe
In [15]:
pandas_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6234 entries, 0 to 6233
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       6234 non-null   int64 
 1   type          6234 non-null   object
 2   title         6234 non-null   object
 3   director      4265 non-null   object
 4   cast          5664 non-null   object
 5   country       5758 non-null   object
 6   date_added    6223 non-null   object
 7   release_year  6234 non-null   int32 
 8   rating        6224 non-null   object
 9   duration      6234 non-null   object
 10  listed_in     6234 non-null   object
 11  description   6234 non-null   object
dtypes: int32(1), int64(1), object(10)
memory usage: 560.2+ KB

As you can see, the 'show_id' column is currently read as an integer. However, it is important to note that this column serves as a unique identifier for each show and should not be treated as a numerical value for statistical analysis.

In [16]:
pandas_data['show_id'] = pandas_data.show_id.astype('object')

Converting the unique identifier, such as the 'show_id' column, to an object type ensures that it will not be included in any statistical calculations, preserving the integrity of the data and avoiding any inaccuracies in the results. This approach is similar to how it is handled in Spark, where the 'show_id' is treated as an object, ensuring that it is not included in any statistics.

Netflix Data Null Values¶


Spark Nulls¶

The select method is used in Spark to select specific columns or to perform operations on the columns and create new columns. In this case, it is used to perform a count of the null values in each column of the DataFrame spark_data by using the count function and the when expression to count only the null values. The alias method is used to give the new column the same name as the original column.

Converting the resulting DataFrame to a Pandas DataFrame using toPandas() method has the advantage of being able to use the more advanced and convenient manipulation tools provided by Pandas, such as the sort_values and round methods. This allows you to easily sort the DataFrame by the total number of null values in descending order and round the percentage values to 3 decimal places. Also, with pandas you can use many libraries that are not available on pyspark like matplotlib, seaborn etc.

It is worth noting that converting the DataFrame to Pandas is not always the best option when working with large datasets, as it can cause performance issues due to the need to transfer the data from the Spark cluster to the local machine. However, if the dataset is small enough to fit in memory, converting to Pandas can be a useful tool for data manipulation and analysis.

In [17]:
spark_data_nulls = spark_data.select([count(when(col(c).isNull(), c)).alias(c) for c in spark_data.columns])
spark_nulls = spark_data_nulls.toPandas()
missing_spark_df = pd.DataFrame()
missing_spark_df['category'] = spark_nulls.columns
missing_spark_df['total'] = spark_nulls.iloc[0].values
missing_spark_df['percentage'] = missing_spark_df['total']/ spark_data.count()
missing_spark_df = missing_spark_df.sort_values('total', ascending=0).round(4)

Converting values to a tidy table: Why it matters?¶

Converting the total counts and percentages of null values in a DataFrame to a separate DataFrame is convenient for several reasons. One of the main benefits is the ability to easily visualize the data. By having the null values data in a separate DataFrame, it makes it simpler to create graphs and charts that show the distribution of null values across the columns. This can be useful for identifying patterns in the data and identifying which columns have a high number of null values, which can be useful in the data cleaning process.

Another advantage is that it facilitates data analysis. By having the null values data in a separate DataFrame, it makes it easier to analyze the data and make decisions about how to handle missing values. For example, you can use the percentage of null values to decide whether to drop a column, impute the missing values, or use some other strategy.

Additionally, having the null values data in a separate DataFrame also makes it easier to compare different datasets. You can compare the null values of two datasets to see how they differ and identify if one dataset is missing more data than the other.

In [18]:
opt.paging=False
display(missing_spark_df)
categorytotalpercentage
director19710.3161
date_added6610.106
cast5710.0916
country4780.0767
rating120.0019
release_year100.0016
listed_in30.0005
show_id20.0003
title20.0003
duration20.0003
type10.0002
Showing 1 to 11 of 11 entries

Data nulls results¶

Based on the results of the null values DataFrame generated with Spark, it appears that several columns have a significant number of missing values. The 'director', 'cast', 'country', 'date_added', and 'rating' columns have missing values at a rate of more than 7%.

This information can be used to make decisions about how to handle these missing values. There are several options for dealing with missing data, including dropping the columns, imputing the missing values, or using some other strategy.

  • Carl Shan, Max Song, Henry Wang, And William Chen - 120 real data science interview questions (2015)
  • DSI ACE PREP - Data Science Interview_ Prep for SQL, Panda, Python, R Language, Machine Learning, DBMS and RDBMS – And More – The Full Data Sci (2022, Data Science Interview Books)
  • Hemant Jain - Problem Solving in Data Structures & Algorithms Using Python_ Programming Interview Guide (2016, Createspace Independent Publishing Platform)
  • Vamsee Puligadda - 500 Most Important Data Science Interview Questions and Answers (2018, Indian Wolf Publications)
  • Vishwanathan Narayanan - Data Science and Machine Learning Interview Questions Using Python_ A Complete Question Bank to Crack Your Interview
Bar Plot¶
In [19]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot 
init_notebook_mode(connected=False)
import plotly.express as px

scale = 3/4

reds = ['salmon','darksalmon','red','crimson','darkred']
reds2 = ['red', 'crimson', 'darkred']
item1 = px.bar(missing_spark_df, y='category', x='total', orientation='h',color= 'category', color_discrete_sequence=['red'], template='plotly_dark', height=625)

def update_style(item, title,size, h, w):
    fig = item.update_layout(height=h, width= w, title={'text': title, 'x':0.5, 'xanchor':'center', 'yanchor': 'top', 'font':{'size':size, 'color':'red', 'family':'Balto'}}, xaxis_showgrid=False, yaxis_showgrid=False, #plot_bgcolor='rgba(0, 0, 0, 0)', paper_bgcolor='rgba(0, 0, 0, 0)', 
                             template= 'plotly_dark')
    return fig
item1= update_style(item1, 'Spark null values', 40*scale, 600*scale, 1200*scale)
item1.show()
0500100015002000typedurationtitleshow_idlisted_inrelease_yearratingcountrycastdate_addeddirector
categorydirectordate_addedcastcountryratingrelease_yearlisted_inshow_idtitledurationtypeSpark null valuestotalcategory
plotly-logomark
Pie Plot¶
In [ ]:
item2 = px.pie(missing_spark_df, values='total', names='category', hole=0.4, template='plotly_dark', color_discrete_sequence=reds[::-1])
item2 = update_style(item2, 'Total Null values percentage',40*scale, 600*scale, 1200*scale)
item2.show()

The graph shows the total number of missing values in each column, divided into categories. The categories are: 'director', 'date_added', 'cast', 'country', 'rating', 'release_year', 'listed_in', 'show_id', 'title', 'duration', and 'type'. The bar chart displays the total number of missing values in each category, and the pie chart displays the percentage from the total of missing values in each category. The results show that the column with the most missing values is 'director' with 1971 missing values, followed by 'date_added' with 661 missing values. The least number of missing values is in the 'show_id' and 'title' columns with only 2 missing values each. The pie chart displays that the 'director' column has the largest percentage of missing values at 53.1% from the total.

Words Cloud¶
In [20]:
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap
from wordcloud import WordCloud

cmap = LinearSegmentedColormap.from_list("", ['#f00','#7a1b0c'])
spark_words = [(w, f) for w,f in zip(missing_spark_df.category, missing_spark_df.total)]
wordcloud = WordCloud(width = 1440, height = 190,
                background_color ='black',
                mode = "RGBA",
                colormap = cmap,
                min_font_size = 9,
                relative_scaling = 0.01,
                #stopwords = STOPWORDS,
                )
wordcloud.generate_from_frequencies(dict(spark_words))
# show words cloud
plt.figure(figsize = (10, 10), dpi= 300, facecolor = None)
plt.imshow(wordcloud)
plt.margins(x=0, y=0)
plt.axis("off")
plt.show()

Pandas Nulls¶

Whe we are talking about pandas, we can use the isna() method which is used to check for missing values in a DataFrame or Series. It returns a Boolean mask indicating whether each element is missing (True) or not (False). When this method is used with the sum() method, it returns the number of missing values in each column.

In [21]:
bool_na = pandas_data.isna().sum() > 0
total = pandas_data.isna().sum()[bool_na]
missing_pandas_df= pd.DataFrame()
missing_pandas_df['category'] = total.index
missing_pandas_df['total'] = total.values
missing_pandas_df['percentage'] = total.values/ pandas_data.shape[0]
missing_pandas_df = missing_pandas_df.round(4)

Creating a "tidy" table is an important step when working with data in order to have greater efficiency when generating interactive charts.

In [22]:
display(missing_pandas_df)
categorytotalpercentage
director19690.3158
cast5700.0914
country4760.0764
date_added110.0018
rating100.0016
Showing 1 to 5 of 5 entries

Data nulls results¶

From these results, we can be concluded that the columns 'director', 'cast' and 'country', have the highest number of missing values, while 'date_added' and 'rating' have the lowest number of missing values. Overall, it seems that the dataset has a moderate amount of missing values.

Bar Plot¶
In [23]:
item3 = px.bar(missing_pandas_df, y='category', x='total', orientation='h', color= 'category', color_discrete_sequence=['red'], template='plotly_dark')
item3 = update_style(item3, 'Pandas null values', 40*scale, 600*scale, 1200*scale)
item3.show()
0500100015002000ratingdate_addedcountrycastdirector
categorydirectorcastcountrydate_addedratingPandas null valuestotalcategory
plotly-logomark
Pie Plot¶
In [24]:
item4 = px.pie(missing_pandas_df, values='total', names='category', hole=.4, opacity=.7, color_discrete_sequence= reds[::-1], template='plotly_dark')
item4 = update_style(item4, 'Total Null values percentage', 40*scale, 600*scale, 1200*scale)
item4.show()
64.9%18.8%15.7%0.362%0.329%
directorcastcountrydate_addedratingTotal Null values percentage
plotly-logomark

The comparison of the total missing values between Spark and Pandas shows that Pandas has lower missing values in the columns "director", "cast", "country", and "rating". The total number of missing values in the "director" column is 1969 in Pandas and 1971 in Spark. The total number of missing values in the "cast" column is 570 in Pandas and 571 in Spark. The total number of missing values in the "country" column is 476 in Pandas and 478 in Spark. And the most important result was the total number of missing values in the "date_added" column which is 11 in Pandas and 661 in Spark**.

Words Cloud¶
In [25]:
cmap = LinearSegmentedColormap.from_list("", ['#f00','#7a1b0c'])
pandas_words = [(w, f) for w,f in zip(missing_pandas_df.category, missing_pandas_df.total)]
wordcloud = WordCloud(width = 1440, height = 190,
                background_color ='black',
                mode = "RGBA",
                colormap = cmap,
                min_font_size = 1,
                relative_scaling = 0.03,
                #stopwords = STOPWORDS,
                )
wordcloud.generate_from_frequencies(dict(pandas_words))
# show words cloud
plt.figure(figsize = (10, 10), dpi= 300, facecolor = None)
plt.imshow(wordcloud)
plt.margins(x=0, y=0)
plt.axis("off")
plt.show()

Nulls Data Merge¶


The merge function in pandas can be used to compare and combine the missing values identified by Spark and pandas. The function links rows of two dataframes based on one or more common columns, known as the "key" columns. It is possible to specify the type of join to be performed using the "how" parameter, which can take values such as "left", "right", "outer", or "inner".

Additionally, the merge function can handle duplicate column names by appending a suffix to the column names of one of the dataframes. This can be done using the "suffixes" parameter, which takes a tuple of two strings to be appended to the column names of the left and right dataframes, respectively.

In [26]:
missing_spark_pandas = missing_spark_df.merge(missing_pandas_df, on='category', how='outer', suffixes=('_spark', '_pandas')).fillna(0)
missing_spark_pandas['difference'] = missing_spark_pandas.total_spark - missing_spark_pandas.total_pandas
missing_spark_pandas['total_cum_spark'] = missing_spark_pandas.total_spark.cumsum()
missing_spark_pandas['total_cum_pandas'] = missing_spark_pandas.total_pandas.cumsum()
missing_spark_pandas['diff_cum'] = missing_spark_pandas.difference.cumsum()
display(missing_spark_pandas)
categorytotal_sparkpercentage_sparktotal_pandaspercentage_pandasdifferencetotal_cum_sparktotal_cum_pandasdiff_cum
director19710.316119690.31582197119692
date_added6610.106110.001865026321980652
cast5710.09165700.0914132032550653
country4780.07674760.0764236813026655
rating120.0019100.0016236933036657
release_year100.0016001037033036667
listed_in30.000500337063036670
show_id20.000300237083036672
title20.000300237103036674
duration20.000300237123036676
type10.000200137133036677
Showing 1 to 11 of 11 entries

For each category in the dataset, the number of missing values and the percentage of missing values are provided for both Spark and Pandas. The difference between the two is also calculated. The cumulative total and cumulative difference between Spark and Pandas is also calculated.

Based on the results, it appears that there are differences in the number of missing values found by Spark and Pandas for several categories. For example, for the "director" category, Spark found 2 more missing values compared to Pandas. For the "date_added" category, Pandas found 650 fewer missing values than Spark. These differences may indicate that the two libraries handle missing data differently.

So, we are going to work with the data imported using pandas.

The reasons are:

Advantages:

  • Lower number of missing values - Pandas has fewer missing values in most of the categories compared to Spark.
  • Lower number of columns with missing values - Pandas has fewer columns with missing values compared to Spark.

Exceptions:

  • It is important to note that the choice of the best tool for data import depends on various factors such as the size of the dataset, computational resources, and the required processing power. In this case, Pandas appears to be the better choice for the given results.

Plotly templates¶

Plotly templates are a great way to customize the look and feel of your plots. They allow you to easily change the colors, fonts, and other visual elements of your plots without having to write any code. This can be useful when you want to quickly create a plot that looks good and is easy to read.

In [27]:
import plotly.io as pio
pio.templates
Out[27]:
Templates configuration
-----------------------
    Default template: 'plotly'
    Available templates:
        ['ggplot2', 'seaborn', 'simple_white', 'plotly',
         'plotly_white', 'plotly_dark', 'presentation', 'xgridoff',
         'ygridoff', 'gridon', 'none']

Bar Plot¶

In [28]:
item5 = px.bar(missing_spark_pandas, x='category', y=['total_spark', 'total_pandas', 'difference'], orientation='v', barmode='group', color_discrete_sequence=reds2, template='plotly_dark')
item5 = update_style(item5, 'Total Null Values Comparision', 40*scale, 600*scale, 1200*scale)
item5.show()
directordate_addedcastcountryratingrelease_yearlisted_inshow_idtitledurationtype0500100015002000
variabletotal_sparktotal_pandasdifferenceTotal Null Values Comparisioncategoryvalue
plotly-logomark

Line Plot¶

In [29]:
item6 = px.line(missing_spark_pandas, x='category', y=['total_spark', 'total_pandas','difference'], color_discrete_sequence=reds2, template='presentation')
item6 = update_style(item6, 'Total Null Values Comparision', 40*scale, 600*scale, 1200*scale)
item6.show()
directordate_addedcastcountryratingrelease_yearlisted_inshow_idtitledurationtype0500100015002000
variabletotal_sparktotal_pandasdifferenceTotal Null Values Comparisioncategoryvalue
plotly-logomark

Comparing Pandas and Spark Data Processing: Key Findings and Insights¶


From the bar and line graphs that were generated, it can be seen that the number of missing values for each column in the imported data sets of Spark and Pandas were compared. The difference in missing values for the "date_added" column is evident, with a difference of 650. The graph highlights that in this case, Pandas was better optimized in processing the date format compared to Spark, which had difficulty correctly processing the date format. This can be concluded by comparing the missing value count of "date_added" in Spark (661) to that in Pandas (11).

Line Plot¶

In [30]:
item7 = px.line(missing_spark_pandas, x='category', y=['total_cum_spark', 'total_cum_pandas'], color_discrete_sequence=reds2, template='presentation')
item7 = update_style(item7, 'Total Cumulative Null Values Comparision', 40*scale, 600*scale, 1400*scale)
item7.update_layout(template='presentation')
item7.show()
directordate_addedcastcountryratingrelease_yearlisted_inshow_idtitledurationtype2000250030003500
variabletotal_cum_sparktotal_cum_pandasTotal Cumulative Null Values Comparisioncategoryvalue
plotly-logomark

In the line graph, it is evident that the data imported through Spark has a higher cumulative total of 3713 null values, compared to the data imported through Pandas with a cumulative total of 3036 null values. This difference can be attributed to the significant number of null values present in the data imported through Spark. The graph clearly demonstrates this disparity between the two datasets.

Words Cloud¶

In [31]:
word= list(missing_spark_df.category.values)+list(missing_pandas_df.category.values)
freq= list(missing_spark_df.total.values)+list(missing_pandas_df.total.values)
cmap = LinearSegmentedColormap.from_list("", ['#f00','#7a1b0c'])
words = [(w, f) for w,f in zip(word, freq)]
wordcloud = WordCloud(width = 1440, height = 249,
                background_color ='black',
                mode = "RGBA",
                colormap = cmap,
                min_font_size = 1,
                relative_scaling = 0.03,
                #stopwords = STOPWORDS,
                )
wordcloud.generate_from_frequencies(dict(words))
# show words cloud
plt.figure(figsize = (10, 10), dpi= 300, facecolor = None)
plt.imshow(wordcloud)
plt.margins(x=0, y=0)
plt.axis("off")
plt.show()

Impute Null Values¶


In many cases, data is collected and used solely for informational purposes, such as creating dashboards or visualizations. In such scenarios, it is not necessary to perform complex imputation methods, such as machine learning, to fill in missing values. Instead, a simple approach, such as removing rows with missing values or replacing them with a central tendency measure, can be sufficient.

This approach is known as simple imputation and is suitable for datasets where the goal is to provide a general overview rather than making predictions or performing in-depth analysis. In the case of Netflix movies, for example, missing values may be present in the data for various reasons, such as data entry errors or missing information from the source.

By removing rows with missing values or replacing them with a central tendency measure, such as the mean or median, we can still provide a general overview of the data and display it in a meaningful way. This approach is also quick and easy to implement, making it a practical solution for data that is used solely for informational purposes.

In [32]:
data = pandas_data.copy()

Nulls Values Matrix¶

In [33]:
item8 = px.imshow(data.isna(), color_continuous_scale=['red', 'black'],
                  template='plotly_dark', range_color=[0,1],
                    labels=dict(x="Columns", y="No. Record", color="Null Values"))

# Add color bar legend in range (0,1)
item8.update_layout(coloraxis_colorbar=dict(
    title="",
    tickvals=[0, 1],
    ticktext=["Not Nulls", "Nulls"],
    ticks="inside",
    thickness=15,
    tickfont=dict(size=15),
    outlinecolor="lightgray",
    outlinewidth=1
))
item8.update_layout(paper_bgcolor='rgba(0, 0, 0, 1)')

item8 = update_style(item8, 'Null Values Matrix', 40*scale,450, 1440*scale)
item8.show()
show_idtypetitledirectorcastcountrydate_addedrelease_yearratingdurationlisted_indescription6000500040003000200010000
Not NullsNullsNull Values MatrixColumnsNo. Record
plotly-logomark
In [34]:
opt.paging = False
display(missing_pandas_df)
categorytotalpercentage
director19690.3158
cast5700.0914
country4760.0764
date_added110.0018
rating100.0016
Showing 1 to 5 of 5 entries
In [35]:
missing_types= data[missing_pandas_df.category].dtypes.reset_index()
missing_types.columns = ['category', 'data_type']
display(missing_types)
categorydata_type
directorobject
castobject
countryobject
date_addedobject
ratingobject
Showing 1 to 5 of 5 entries

Exploring and Cleaning¶

In [36]:
opt.paging= True
display(data[data.director.isna()])
show_idtypetitledirectorcastcountrydate_addedrelease_yearratingdurationlisted_indescription
80117401MovieJandino: Whatever it TakesNaNJandino AsporaatUnited KingdomSeptember 9, 20162016TV-MA94 minStand-Up ComedyJandino Asporaat riffs on the challenges of raising kids and serenades the audience with a rousing rendition of "Sex on Fire" in his comedy show.
70234439TV ShowTransformers PrimeNaNPeter Cullen, Sumalee Montano, Frank Welker, Jeffrey Combs, Kevin Michael Richardson, Tania Gunadi, Josh Keaton, Steve Blum, Andy Pessoa, Ernie Hudson, Daran Norris, Will FriedleUnited StatesSeptember 8, 20182013TV-Y7-FV1 SeasonKids' TVWith the help of three human allies, the Autobots once again protect Earth from the onslaught of the Decepticons and their leader, Megatron.
80058654TV ShowTransformers: Robots in DisguiseNaNWill Friedle, Darren Criss, Constance Zimmer, Khary Payton, Mitchell Whitfield, Stuart Allan, Ted McGinley, Peter CullenUnited StatesSeptember 8, 20182016TV-Y71 SeasonKids' TVWhen a prison ship crash unleashes hundreds of Decepticons on Earth, Bumblebee leads a new Autobot force to protect humankind.
80163890TV ShowApachesNaNAlberto Ammann, Eloy Azorín, Verónica Echegui, Lucía Jiménez, Claudia TraisacSpainSeptember 8, 20172016TV-MA1 SeasonCrime TV Shows, International TV Shows, Spanish-Language TV ShowsA young journalist is forced into a life of crime to save his father and family in this series based on the novel by Miguel Sáez Carral.
80117902TV ShowFire ChasersNaNNaNUnited StatesSeptember 8, 20172017TV-MA1 SeasonDocuseries, Science & Nature TVAs California's 2016 fire season rages, brave backcountry firefighters race to put out the flames, protect homes and save lives in this docuseries.
80182480MovieKrish Trish and BaltiboyNaNDamandeep Singh Baggan, Smita Malhotra, Baba SehgalNaNSeptember 8, 20172009TV-Y758 minChildren & Family MoviesA team of minstrels, including a monkey, cat and donkey, narrate folktales from the Indian regions of Rajasthan, Kerala and Punjab.
80182481MovieKrish Trish and Baltiboy: Part IINaNDamandeep Singh Baggan, Smita Malhotra, Baba SehgalNaNSeptember 8, 20172010TV-Y758 minChildren & Family MoviesAnimal minstrels narrate stories about a monkey's friendship with a crocodile, two monkeys' foolishness and a villager's encounter with a demon.
80244601TV ShowCastle of StarsNaNChaiyapol Pupart, Jintanutda Lummakanon, Worrawech Danuwong, Ornjira Lamwilai, Yong Armchair, Keerati Mahapreukpong, Akarat Nimitchai, Kornpassorn Duaysianklao, Nattapong ChartpongNaNSeptember 7, 20182015TV-141 SeasonInternational TV Shows, Romantic TV Shows, TV ComediesAs four couples with different lifestyles go through the ebbs and flows of joy and sorrow, each must learn how to live a good life.
80190843TV ShowFirst and LastNaNNaNNaNSeptember 7, 20182018TV-MA1 SeasonDocuseriesTake an intimate look at the emotionally charged first and last days of new and soon-to-be released inmates at Georgia's Gwinnett County Jail.
80221550TV ShowArchibald's Next Big ThingNaNTony Hale, Rosamund Pike, Jordan Fisher, Chelsea Kane, Adam Pally, Kari WahlgrenNaNSeptember 6, 20192019TV-Y71 SeasonKids' TV, TV ComediesHappy-go-lucky chicken Archibald may not remember to do his chores, but he never forgets to have fun. After all, life's an adventure!
Showing 1 to 10 of 1,969 entries
Previous12345…197Next
In [37]:
data.director.fillna('no director', inplace=True)
display(data.director)
director
Richard Finn, Tim Maltby
no director
no director
no director
Fernando Lebrija
no director
Gabe Ibáñez
Rodrigo Toro, Francisco Schultz
no director
Henrik Ruben Genz
Showing 1 to 10 of 6,234 entries
Previous12345…624Next

The 'director' column was identified as having null values. To address this issue, a simple and straightforward solution was employed. The missing values were replaced with the string 'no director' string.

In [38]:
display(data[data.cast.isna()])
show_idtypetitledirectorcastcountrydate_addedrelease_yearratingdurationlisted_indescription
80117902TV ShowFire Chasersno directorNaNUnited StatesSeptember 8, 20172017TV-MA1 SeasonDocuseries, Science & Nature TVAs California's 2016 fire season rages, brave backcountry firefighters race to put out the flames, protect homes and save lives in this docuseries.
80046727MovieRolling PapersMitch DickmanNaNUnited States, UruguaySeptember 8, 20172015TV-MA79 minDocumentariesAs the newspaper industry takes a hit, The Denver Post breaks new ground with a section dedicated to cannabis culture.
80203094MovieCity of JoyMadeleine GavinNaNUnited States,September 7, 20182018TV-MA77 minDocumentariesWomen who've been sexually brutalized in war-torn Congo begin to heal at City of Joy, a center that helps them regain a sense of self and empowerment.
80190843TV ShowFirst and Lastno directorNaNNaNSeptember 7, 20182018TV-MA1 SeasonDocuseriesTake an intimate look at the emotionally charged first and last days of new and soon-to-be released inmates at Georgia's Gwinnett County Jail.
81011682TV ShowChristiane Amanpour: Sex & Love Around the Worldno directorNaNUnited StatesSeptember 30, 20182018TV-MA1 SeasonDocuseriesA contextual, acculturated dive into how adults in six, big global cities celebrate love and accentuate sex. CNN's Christiane Amanpour reports.
80128317TV ShowThe Eightiesno directorNaNUnited StatesSeptember 30, 20182016TV-PG1 SeasonDocuseriesThis nostalgic documentary series relives the 1980s from a variety of angles, exploring its impact on the politics, technology and culture of today.
81027384TV ShowThe Ninetiesno directorNaNUnited StatesSeptember 30, 20182017TV-141 SeasonDocuseriesThis U.S.-focused series dives into the cultural and political changes that swept the last decade of the 20th century. Executive produced by Tom Hanks.
80030186TV ShowThe Seventiesno directorNaNUnited StatesSeptember 30, 20182015TV-PG1 SeasonDocuseriesThis series examines 1970s America, focusing on the major political and historical landmarks of the decade and the cultural response to those events.
80181555TV ShowThe Royal House of Windsorno directorNaNUnited KingdomSeptember 30, 20172017TV-141 SeasonBritish TV Shows, Docuseries, International TV ShowsDrawing on newly available info, this show traces how the British royal family has survived the last 100 years of power struggles, politics and more.
80081155MovieAmanda KnoxRod Blackhurst, Brian McGinnNaNDenmark, United StatesSeptember 30, 20162016TV-MA92 minDocumentariesShe was twice convicted and acquitted of murder. Amanda Knox and the people closest to her case speak out in this illuminating documentary.
Showing 1 to 10 of 570 entries
Previous12345…57Next
In [39]:
data.cast.fillna('no cast', inplace=True)
data.cast
Out[39]:
cast
Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Durupt, Maya Kay, Michael Dobson
Jandino Asporaat
Peter Cullen, Sumalee Montano, Frank Welker, Jeffrey Combs, Kevin Michael Richardson, Tania Gunadi, Josh Keaton, Steve Blum, Andy Pessoa, Ernie Hudson, Daran Norris, Will Friedle
Will Friedle, Darren Criss, Constance Zimmer, Khary Payton, Mitchell Whitfield, Stuart Allan, Ted McGinley, Peter Cullen
Nesta Cooper, Kate Walsh, John Michael Higgins, Keith Powers, Alicia Sanz, Jake Borelli, Kid Ink, Yousef Erakat, Rebekah Graf, Anne Winters, Peter Gilroy, Patrick Davis
Alberto Ammann, Eloy Azorín, Verónica Echegui, Lucía Jiménez, Claudia Traisac
Antonio Banderas, Dylan McDermott, Melanie Griffith, Birgitte Hjort Sørensen, Robert Forster, Christa Campbell, Tim McInnerny, Andy Nyman, David Ryall
Fabrizio Copano
no cast
James Franco, Kate Hudson, Tom Wilkinson, Omar Sy, Sam Spruell, Anna Friel, Thomas Arnold, Oliver Dimsdale, Diana Hardcastle, Michael Jibson, Diarmaid Murtagh
Showing 1 to 10 of 6,234 entries
Previous12345…624Next

The 'cast' column was identified as having null values. To address this issue, a simple and straightforward solution was employed. The missing values were replaced with the string 'no cast' string.

In [40]:
display(data[data.country.isna()])
show_idtypetitledirectorcastcountrydate_addedrelease_yearratingdurationlisted_indescription
80169755MovieJoaquín Reyes: Una y no másJosé Miguel ContrerasJoaquín ReyesNaNSeptember 8, 20172017TV-MA78 minStand-Up ComedyComedian and celebrity impersonator Joaquín Reyes decides to be his zesty self for a night of stories about buses, bathroom habits, royalty and more.
80182480MovieKrish Trish and Baltiboyno directorDamandeep Singh Baggan, Smita Malhotra, Baba SehgalNaNSeptember 8, 20172009TV-Y758 minChildren & Family MoviesA team of minstrels, including a monkey, cat and donkey, narrate folktales from the Indian regions of Rajasthan, Kerala and Punjab.
80182483MovieKrish Trish and Baltiboy: Battle of WitsMunjal Shroff, Tilak ShettyDamandeep Singh Baggan, Smita Malhotra, Baba Sehgal, Deepak ChachraNaNSeptember 8, 20172013TV-Y762 minChildren & Family MoviesAn artisan is cheated of his payment, a lion of his throne and a brother of his inheritance in these three stories of deception and justice.
80182596MovieKrish Trish and Baltiboy: Best Friends ForeverMunjal Shroff, Tilak ShettyDamandeep Singh Baggan, Smita Malhotra, Deepak ChachraNaNSeptember 8, 20172016TV-Y65 minChildren & Family MoviesA cat, monkey and donkey team up to narrate folktales about friendship from Northeast India, and the Indian regions of Bihar and Rajasthan.
80182482MovieKrish Trish and Baltiboy: Comics of IndiaTilak ShettyDamandeep Singh Baggan, Smita Malhotra, Baba SehgalNaNSeptember 8, 20172012TV-Y761 minChildren & Family MoviesIn three comic-strip-style tales, a boy tries to sell wisdom, a demon is released from captivity, and a king assigns impossible tasks to his minister.
80182597MovieKrish Trish and Baltiboy: Oversmartness Never PaysTilak ShettyRishi Gambhir, Smita Malhotra, Deepak ChachraNaNSeptember 8, 20172017TV-Y765 minChildren & Family MoviesA cat, monkey and donkey learn the consequences of cheating through stories from the Indian regions of Rajasthan, West Bengal and Maharashtra.
80182481MovieKrish Trish and Baltiboy: Part IIno directorDamandeep Singh Baggan, Smita Malhotra, Baba SehgalNaNSeptember 8, 20172010TV-Y758 minChildren & Family MoviesAnimal minstrels narrate stories about a monkey's friendship with a crocodile, two monkeys' foolishness and a villager's encounter with a demon.
80182621MovieKrish Trish and Baltiboy: The Greatest TrickMunjal Shroff, Tilak ShettyDamandeep Singh Baggan, Smita Malhotra, Baba SehgalNaNSeptember 8, 20172013TV-Y760 minChildren & Family MoviesThe consequences of trickery are explored in stories involving an inconsiderate husband, two greedy courtiers, and a kind man who loses everything.
80244601TV ShowCastle of Starsno directorChaiyapol Pupart, Jintanutda Lummakanon, Worrawech Danuwong, Ornjira Lamwilai, Yong Armchair, Keerati Mahapreukpong, Akarat Nimitchai, Kornpassorn Duaysianklao, Nattapong ChartpongNaNSeptember 7, 20182015TV-141 SeasonInternational TV Shows, Romantic TV Shows, TV ComediesAs four couples with different lifestyles go through the ebbs and flows of joy and sorrow, each must learn how to live a good life.
80190843TV ShowFirst and Lastno directorno castNaNSeptember 7, 20182018TV-MA1 SeasonDocuseriesTake an intimate look at the emotionally charged first and last days of new and soon-to-be released inmates at Georgia's Gwinnett County Jail.
Showing 1 to 10 of 476 entries
Previous12345…48Next
In [41]:
data.country.fillna('no country', inplace=True)
In [42]:
data.country
Out[42]:
country
United States, India, South Korea, China
United Kingdom
United States
United States
United States
Spain
Bulgaria, United States, Spain, Canada
Chile
United States
United States, United Kingdom, Denmark, Sweden
Showing 1 to 10 of 6,234 entries
Previous12345…624Next

The 'country' column was identified as having null values. To address this issue, a simple and straightforward solution was employed. The missing values were replaced with the string 'no country' string.

In [43]:
display(data[data.date_added.isna()])
show_idtypetitledirectorcastcountrydate_addedrelease_yearratingdurationlisted_indescription
70204989TV ShowGunslinger Girlno directorYuuka Nanri, Kanako Mitsuhashi, Eri Sendai, Ami Koshimizu, Hidenobu Kiuchi, Mitsuru Miyamoto, Masashi Ebara, Norihiro Inoue, Rie Nakagawa, Masami Iwasaki, Laura Bailey, Luci Christian, Caitlin Glass, Alese Johnson, Monica Rial, Jerry Jewell, Stephanie Young, Mike McFarlandJapanNaN2008TV-142 SeasonsAnime Series, Crime TV ShowsOn the surface, the Social Welfare Agency appears to help orphaned schoolgirls, but it's actually turning them into lethal agents.
70304979TV ShowAnthony Bourdain: Parts Unknownno directorAnthony BourdainUnited StatesNaN2018TV-PG5 SeasonsDocuseriesThis CNN original series has chef Anthony Bourdain traveling to extraordinary locations around the globe to sample a variety of local cuisines.
70153412TV ShowFrasierno directorKelsey Grammer, Jane Leeves, David Hyde Pierce, Peri Gilpin, John Mahoney, Dan ButlerUnited StatesNaN2003TV-PG11 SeasonsClassic & Cult TV, TV ComediesFrasier Crane is a snooty but lovable Seattle psychiatrist who dispenses advice on his call-in radio show while ignoring it in his own relationships.
70243132TV ShowLa Familia P. Lucheno directorEugenio Derbez, Consuelo Duval, Luis Manuel Ávila, Regina Blandón, Miguel Perez, Barbara Torres, Dalilah Polanco, Pierre AngeloUnited StatesNaN2012TV-143 SeasonsInternational TV Shows, Spanish-Language TV Shows, TV ComediesThis irreverent sitcom featues Ludovico, Federica and their three children Bibi, Junior and Ludoviquito, living in Ciudad P. Luche.
80005756TV ShowThe Adventures of Figaro Phono directorLuke Jurevicius, Craig Behenna, Charlotte Hamlyn, Stavroula Mountzouris, Aletheia BurneyAustraliaNaN2015TV-Y72 SeasonsKids' TV, TV ComediesImagine your worst fears, then multiply them: Figaro is a boy with any number of phobias and a highly quirky and imaginative way of dealing with them.
80159925TV ShowKikorikino directorIgor Dmitrievno countryNaN2010TV-Y2 SeasonsKids' TVA wacky rabbit and his gang of animal pals have fun solving problems, sharing stories and exploring their sometimes magical, always special world.
80000063TV ShowRed vs. Blueno directorBurnie Burns, Jason Saldaña, Gustavo Sorola, Geoff Lazer Ramsey, Joel Heyman, Matt Hullum, Dan Godwin, Kathleen Zuelch, Yomary Cruz, Nathan ZellnerUnited StatesNaN2015NR13 SeasonsTV Action & Adventure, TV Comedies, TV Sci-Fi & FantasyThis parody of first-person shooter games, military life and science-fiction films centers on a civil war fought in the middle of a desolate canyon.
70286564TV ShowMaronno directorMarc Maron, Judd Hirsch, Josh Brener, Nora Zehetner, Andy KindlerUnited StatesNaN2016TV-MA4 SeasonsTV ComediesMarc Maron stars as Marc Maron, who interviews fellow comedians for his popular podcast, only to reveal more about his own neuroses and relationships.
80116008MovieLittle Baby Bum: Nursery Rhyme Friendsno directorno castno countryNaN2016NaN60 minMoviesNursery rhymes and original music for children accompanied by bright, playful animation engage and educate about numbers, shapes, colors and more.
70281022TV ShowA Young Doctor's Notebook and Other Storiesno directorDaniel Radcliffe, Jon Hamm, Adam Godley, Christopher Godwin, Rosie Cavaliero, Vicki Pepperdine, Margaret Clunie, Tim Steed, Shaun PyeUnited KingdomNaN2013TV-MA2 SeasonsBritish TV Shows, TV Comedies, TV DramasSet during the Russian Revolution, this comic miniseries is based on a doctor's memories of his early career working in an out-of-the-way village.
Showing 1 to 10 of 11 entries
Previous12Next

In the case of columns 'date_added' and 'rating', the null values represent a very small percentage of the total data, around 0.2%. Therefore, the most straightforward solution is to simply drop these rows and clean the data, rather than imputing or replacing these values with any other information. This approach reduces the risk of introducing inaccuracies or biases in the data, and helps to maintain the integrity of the information being presented in the dashboard. By removing the null values in these columns, the data will be more consistent and provide a clearer representation of the available information.

In [44]:
data.dropna(subset=['date_added', 'rating'], inplace=True)
data[['date_added','rating']]
Out[44]:
date_addedrating
September 9, 2019TV-PG
September 9, 2016TV-MA
September 8, 2018TV-Y7-FV
September 8, 2018TV-Y7
September 8, 2017TV-14
September 8, 2017TV-MA
September 8, 2017R
September 8, 2017TV-MA
September 8, 2017TV-MA
September 8, 2017R
Showing 1 to 10 of 6,214 entries
Previous12345…622Next

No Null values¶

In [45]:
count_missing = data.isna().sum().reset_index()
count_missing.columns = ['category', 'null values sum']
display(count_missing)
categorynull values sum
show_id0
type0
title0
director0
cast0
country0
date_added0
release_year0
rating0
duration0
Showing 1 to 10 of 12 entries
Previous12Next

Nulls Values Matrix¶

In [46]:
item9 = px.imshow(data.isna(), color_continuous_scale=['red', 'black'],
                  template='plotly_dark', range_color=[0,1],
                    labels=dict(x="Columns", y="No. Record", color="Null Values"))

# Add color bar legend in range (0,1)
item9.update_layout(coloraxis_colorbar=dict(
    title="",
    tickvals=[0, 1],
    ticktext=["Not Nulls", "Nulls"],
    ticks="inside",
    thickness=15,
    tickfont=dict(size=15),
    outlinecolor="lightgray",
    outlinewidth=1
))
item9.update_layout(paper_bgcolor='rgba(0, 0, 0, 1)')

item9 = update_style(item9, 'Null Values Matrix', 40*scale,450, 1440*scale)
item9.show()
show_idtypetitledirectorcastcountrydate_addedrelease_yearratingdurationlisted_indescription6000500040003000200010000
Not NullsNullsNull Values MatrixColumnsNo. Record
plotly-logomark
In [47]:
data.to_csv('netflix_titles_clean.csv', index=False)

The results of the visualization show that the data set no longer contains null values and is almost ready to create the dashboard. The graph represents the presence of null values in the data set and the use of color represents the concentration of missing values. The use of red shows higher concentration of missing values, while the darkred color indicates lower or no missing values.

Export to html¶


In [48]:
!jupyter nbconvert --to html --theme jupyterlab_miami_nights --output Netflix_Null_Values.html Netflix_Null_Values.ipynb
[NbConvertApp] Converting notebook Netflix_Null_Values.ipynb to html
[NbConvertApp] Writing 6189824 bytes to Netflix_Null_Values.html